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Abstract 



This article describes a new system for induction of oblique decision trees. This system, 
OCl, combines deterministic hill-climbing with two forms of randomization to find a good 
oblique split (in the form of a hyperplane) at each node of a decision tree. Oblique decision 
tree methods are tuned especially for domains in which the attributes are numeric, although 
they can be adapted to symbolic or mixed symbolic/numeric attributes. We present ex- 
tensive empirical studies, using both real and artificial data, that analyze OCl's ability to 
construct oblique trees that are smaller and more accurate than their axis-parallel coun- 
terparts. We also examine the benefits of randomization for the construction of oblique 
decision trees. 

1. Introduction 

Current data collection technology provides a unique challenge and opportunity for auto- 
mated machine learning techniques. The advent of major scientific projects such as the 
Human Genome Project, the Hubble Space Telescope, and the human brain mapping ini- 
tiative are generating enormous amounts of data on a daily basis. These streams of data 
require automated methods to analyze, filter, and classify them before presenting them in 
digested form to a domain scientist. Decision trees are a particularly useful tool in this con- 
text because they perform classification by a sequence of simple, easy-to-understand tests 
whose semantics is intuitively clear to domain experts. Decision trees have been used for 
classification and other tasks since the 1960s (Moret, 1982; Safavin & Landgrebe, 1991). In 
the 1980's, Breiman et al.'s book on classification and regression trees (CART) and Quin- 
lan's work on ID3 (Quinlan, 1983, 1986) provided the foundations for what has become a 
large body of research on one of the central techniques of experimental machine learning. 

Many variants of decision tree (DT) algorithms have been introduced in the last decade. 
Much of this work has concentrated on decision trees in which each node checks the value 
of a single attribute (Breiman, Friedman, Olshen, & Stone, 1984; Quinlan, 1986, 1993a). 
Quinlan initially proposed decision trees for classification in domains with symbolic- valued 
attributes (1986), and later extended them to numeric domains (1987). When the attributes 
are numeric, the tests have the form Xi > k, where Xi is one of the attributes of an example 
and A; is a constant. This class of decision trees may be called axis-parallel, because the tests 
at each node are equivalent to axis-parallel hyperplanes in the attribute space. An example 
of such a decision tree is given in Figure 1, which shows both a tree and the partitioning it 
creates in a 2-D attribute space. 
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Figure 1: The left side of the figure shows a simple axis-parallel tree that uses two attributes. 

The right side shows the partitioning that this tree creates in the attribute space. 



Researchers have also studied decision trees in which the test at a node uses boolean 
combinations of attributes (Pagallo, 1990; Pagallo & Haussler, 1990; Sahami, 1993) and 
linear combinations of attributes (see Section 2). Different methods for measuring the 
goodness of decision tree nodes, as well as techniques for pruning a tree to reduce overfitting 
and increase accuracy have also been explored, and will be discussed in later sections. 

In this paper, we examine decision trees that test a linear combination of the attributes 
at each internal node. More precisely, let an example take the form X = xi, X2, . . . , x^, Cj 
where Cj is a class label and the Sj's are real-valued attributes.^ The test at each node will 
then have the form: 



where ai, . . . , a^+i are real- valued coefficients. Because these tests are equivalent to hy- 
perplanes at an oblique orientation to the axes, we call this class of decision trees oblique 
decision trees. (Trees of this form have also been called "multivariate" (Brodley & Utgoff, 
1994). We prefer the term "oblique" because "multivariate" includes non-linear combina- 
tions of the variables, i.e., curved surfaces. Our trees contain only linear tests.) It is clear 
that these are simply a more general form of axis-parallel trees, since by setting Ui = 
for all coefficients but one, the test in Eq. 1 becomes the familiar univariate test. Note 
that oblique decision trees produce polygonal (polyhedral) partitionings of the attribute 
space, while axis-parallel trees produce partitionings in the form of hyper-rectangles that 
are parallel to the feature axes. 

It should be intuitively clear that when the underlying concept is defined by a polyg- 
onal space partitioning, it is preferable to use oblique decision trees for classification. For 
example, there exist many domains in which one or two oblique hyperplanes will be the 
best model to use for classification. In such domains, axis-parallel methods will have to ap- 

1. The constraint that xi,...,Xd are real- valued does not necessarily restrict oblique decision trees to 
numeric domains. Several researchers have studied the problem of converting symbolic (unordered) 
domains to numeric (ordered) domains and vice versa; e.g., (Breiman et al., 1984; Hampson & Volper, 
1986; Utgoff & Brodley, 1990; Van de Merckt, 1992, 1993). To keep the discussion simple, however, we 
will assume that all attributes have numeric values. 
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Figure 2: The left side shows a simple 2-D domain in which two oblique hyperplanes define 
the classes. The right side shows an approximation of the sort that an axis-parallel 
decision tree would have to create to model this domain. 

proximate the correct model with a staircase-like structure, while an oblique tree-building 
method could capture it with a tree that was both smaller and more accurate.^ Figure 2 
gives an illustration. 

Breiman et al. first suggested a method for inducing oblique decision trees in 1984. How- 
ever, there has been very little further research on such trees until relatively recently (Utgoff 
& Brodley, 1990; Heath, Kasif, & Salzberg, 1993b; Murthy, Kasif, Salzberg, & Beigel, 1993; 
Brodley & Utgoff, 1994). A comparison of existing approaches is given in more detail in 
Section 2. The purpose of this study is to review the strengths and weaknesses of existing 
methods, to design a system that combines some of the strengths and overcomes the weak- 
nesses, and to evaluate that system empirically and analytically. The main contributions 
and conclusions of our study are as follows: 

• We have developed a new, randomized algorithm for inducing oblique decision trees 
from examples. This algorithm extends the original 1984 work of Breiman et al. 
Randomization helps significantly in learning many concepts. 

• Our algorithm is fully implemented as an oblique decision tree induction system and 
is available over the Internet. The code can be retrieved from Online Appendix 1 of 
this paper (or by anonymous ftp from ftp://ftp.cs.jhu.edu/pub/ocl/ocl. tar. Z). 

• The randomized hill-climbing algorithm used in OCX is more efficient than other 
existing randomized oblique decision tree methods (described below). In fact, the 
current implementation of OCX guarantees a worst-case running time that is only 
O(logra) times greater than the worst-case time for inducing axis-parallel trees (i.e., 
O(dn'^logn) vs. 0{dn^)). 

• The ability to generate oblique trees often produces very small trees compared to 
axis-parallel methods. When the underlying problem requires an oblique split, oblique 

2. Note that though a given oblique tree may have fewer leaf nodes than an axis-parallel tree — which is what 
we mean by "smaller" — the oblique tree may in some cases be larger in terms of information content, 
because of the increased complexity of the tests at each node. 
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trees are also more accurate than axis-parallel trees. Allowing a tree-building system 
to use both oblique and axis-parallel splits broadens the range of domains for which 
the system should be useful. 

The remaining sections of the paper follow this outline: the remainder of this section 
briefly outlines the general paradigm of decision tree induction, and discusses the com- 
plexity issues involved in inducing oblique decision trees. Section 2 briefly reviews some 
existing techniques for oblique DT induction, outlines some limitations of each approach, 
and introduces the OCX system. Section 3 describes the OCX system in detail. Section 4 
describes experiments that (X) compare the performance of OCX to that of several other 
axis-parallel and oblique decision tree induction methods on a range of real- world datasets 
and (2) demonstrate empirically that OCX significantly benefits from its randomization 
methods. In Section 5, we conclude with some discussion of open problems and directions 
for further research. 

1.1 Top-Down Induction of Decision Trees 

Algorithms for inducing decision trees follow an approach described by Quinlan as top-down 
induction of decision trees (X986). This can also be called a greedy divide- and- conquer 
method. The basic outline is as follows: 

X. Begin with a set of examples called the training set, T . If all examples in T belong 
to one class, then halt. 

2. Consider all tests that divide T into two or more subsets. Score each test according 
to how well it splits up the examples. 

3. Choose ("greedily") the test that scores the highest. 

4. Divide the examples into subsets and run this procedure recursively on each subset. 

Quinlan's original model only considered attributes with symbolic values; in that model, 
a test at a node splits an attribute into all of its values. Thus a test on an attribute 
with three values will have at most three child nodes, one corresponding to each value. 
The algorithm considers all possible tests and chooses the one that optimizes a pre-defined 
goodness measure. (One could also split symbolic values into two or more subsets of values, 
which gives many more choices for how to split the examples.) As we explain next, oblique 
decision tree methods cannot consider all tests due to complexity considerations. 

1.2 Complexity of Induction of Oblique Decision Trees 

One reason for the relatively few papers on the problem of inducing oblique decision trees is 
the increased computational complexity of the problem when compared to the axis-parallel 
case. There are two important issues that must be addressed. In the context of top-down 
decision tree algorithms, we must address the complexity of finding optimal separating 
hyperplanes (decision surfaces) for a given node of a decision tree. An optimal hyperplane 
will minimize the impurity measure used; e.g., impurity might be measured by the total 
number of examples mis-classified. The second issue is the lower bound on the complexity 
of finding optimal (e.g., smallest size) trees. 
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Figure 3: For n points in d dimensions (n > d), there are n ■ d distinct axis-parallel splits, 
while there are 2''-(^) distinct rf- dimensional oblique splits. This shows all distinct 
oblique and axis-parallel splits for two specific points in 2-D. 

Let us first consider the issue of the complexity of selecting an optimal oblique hyper- 
plane for a single node of a tree. In a domain with n training instances, each described using 
d real-valued attributes, there are at most 2'^ ■ (^) distinct rf- dimensional oblique splits; i.e., 
hyperplanes"^ that divide the training instances uniquely into two nonoverlapping subsets. 
This upper bound derives from the observation that every subset of size d from the n points 
can define a rf- dimensional hyperplane, and each such hyperplane can be rotated slightly 
in 2'^ directions to divide the set of d points in all possible ways. Figure 3 illustrates these 
upper limits for two points in two dimensions. For axis-parallel splits, there are only n ■ d 
distinct possibilities, and axis-parallel methods such as C4.5 (Quinlan, 1993a) and CART 
(Breiman et al., 1984) can exhaustively search for the best split at each node. The problem 
of searching for the best oblique split is therefore much more difficult than that of searching 
for the best axis-parallel split. In fact, the problem is NP-hard. 

More precisely. Heath (1992) proved that the following problem is NP-hard: given a 
set of labelled examples, find the hyperplane that minimizes the number of misclassified 
examples both above and below the hyperplane. This result implies that any method 
for finding the optimal oblique split is likely to have exponential cost (assuming P NP). 
Intuitively, the problem is that it is impractical to enumerate all 2'^ ■ (^) distinct hyperplanes 
and choose the best, as is done in axis-parallel decision trees. However, any non-exhaustive 
deterministic algorithm for searching through all these hyperplanes is prone to getting stuck 
in local minima. 



3. Throughout the paper, we use the terms "split" and "hyperplane" interchangeably to refer to the test 
at a node of a decision tree. The first usage is standard (Moret, 1982), and refers to the fact that the 
test splits the data into two partitions. The second usage refers to the geometric form of the test. 
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On the other hand, it is possible to define impurity measures for which the problem 
of finding optimal hyperplanes can be solved in polynomial time. For example, if one 
minimizes the sum of distances of mis-classified examples, then the optimal solution can 
be found using linear programming methods (if distance is measured along one dimension 
only). However, classifiers are usually judged by how many points they classify correctly, 
regardless of how close to the decision boundary a point may lie. Thus most of the standard 
measures for computing impurity base their calculation on the discrete number of examples 
of each category on either side of the hyperplane. Section 3.3 discusses several commonly 
used impurity measures. 

Now let us address the second issue, that of the complexity of building a small tree. 
It is easy to show that the problem of inducing the smallest axis-parallel decision tree is 
NP-hard. This observation follows directly from the work of Hyafil and Rivest (1976). Note 
that one can generate the smallest axis-parallel tree that is consistent with the training 
set in polynomial time if the number of attributes is a constant. This can be done by 
using dynamic programming or branch and bound techniques (see Moret (1982) for several 
pointers). But when the tree uses oblique splits, it is not clear, even for a fixed number 
of attributes, how to generate an optimal (e.g., smallest) decision tree in polynomial time. 
This suggests that the complexity of constructing good oblique trees is greater than that 
for axis-parallel trees. 

It is also easy to see that the problem of constructing an optimal (e.g., smallest) oblique 
decision tree is NP-hard. This conclusion follows from the work of Blum and Rivest (1988). 
Their result implies that in d dimensions (i.e., with d attributes) the problem of producing 
a 3-node oblique decision tree that is consistent with the training set is NP-complete. More 
specifically, they show that the following decision problem is NP-complete: given a training 
set T with n examples and d Boolean attributes, does there exist a 3-node neural network 
consistent with T? From this it is easy to show that the following question is NP-complete: 
given a training set T, does there exist a 3-leaf-node oblique decision tree consistent with 

r? 

As a result of these complexity considerations, we took the pragmatic approach of trying 
to generate small trees, but not looking for the smallest tree. The greedy approach used by 
OCX and virtually all other decision tree algorithms implicitly tries to generate small trees. 
In addition, it is easy to construct example problems for which the optimal split at a node 
will not lead to the best tree; thus our philosophy as embodied in OCX is to find locally 
good splits, but not to spend excessive computational effort on improving the quality of 
these splits. 

2. Previous Work on Oblique Decision Tree Induction 

Before describing the OCX algorithm, we will briefiy discuss some existing oblique DT 
induction methods, including CART with linear combinations. Linear Machine Decision 
Trees, and Simulated Annealing of Decision Trees. There are also methods that induce 
tree-like classifiers with linear discriminants at each node, most notably methods using 
linear programming (Mangasarian, Setiono, & Wolberg, X990; Bennett & Mangasarian, 
X992, X994a, X994b). Though these methods can find the optimal linear discriminants for 
specific goodness measures, the size of the linear program grows very fast with the number 
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To induce a split at node T of the decision tree: 
Normalize values for all d attributes. 

L = 

While (TRUE) 

L = L+l 

Let the current split sl be v < c, where v = XliLi (^i^i- 
For i = 1, . . . ,d 

For 7 = -0.25,0,0.25 

Search for the 6 that maximizes the goodness of the split v — 6(ai + 7) < c. 

Let 6* ,j* be the settings that result in highest goodness in these 3 searches. 

Gi = Gi — 6* , C = c — 6*J*. 

Perturb c to maximize the goodness of s^, keeping ai, . . . , constant. 

If Igoodness(si) - goodness(si_i)| < e exit while loop. 
Eliminate irrelevant attributes in {ai, . . . , a^} using backward elimination. 
Convert to a split on the un-normalized attributes. 

Return the better of and the best axis-parallel split as the split for T. 

Figure 4: The procedure used by CART with linear combinations (CART-LC) at each node 
of a decision tree. 



of instances and the number of attributes. There is also some less closely related work on 
algorithms to train artificial neural networks to build decision tree-like classifiers (Brent, 
1991; Cios & Liu, 1992; Herman & Yeung, 1992). 

The first oblique decision tree algorithm to be proposed was CART with linear combina- 
tions (Breiman et al., 1984, chapter 5). This algorithm, referred to henceforth as CART-LC, 
is an important basis for OCX. Figure 4 summarizes (using Breiman et al.'s notation) what 
the CART-LC algorithm does at each node in the decision tree. The core idea of the CART- 
LC algorithm is how it finds the value of S that maximizes the goodness of a split. This 
idea is also used in OCX, and is explained in detail in Section 3.X. 

After describing CART-LC, Breiman et al. point out that there is still much room for 
further development of the algorithm. OCX represents an extension of CART-LC that 
includes some significant additions. It addresses the following limitations of CART-LC: 

• CART-LC is fully deterministic. There is no built-in mechanism for escaping local 
minima, although such minima may be very common for some domains. Figure 5 
shows a simple example for which CART-LC gets stuck. 

• CART-LC produces only a single tree for a given data set. 

• CART-LC sometimes makes adjustments that increase the impurity of a split. This 
feature was probably included to allow it to escape some local minima. 

• There is no upper bound on the time spent at any node in the decision tree. It halts 
when no perturbation changes the impurity more than e, but because impurity may 
increase and decrease, the algorithm can spend arbitrarily long time at a node. 
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Figure 5: The deterministic perturbation algorithm of CART-LC fails to find the correct 
split for this data, even when it starts from the location of the best axis-parallel 
split. OCX finds the correct split using one random jump. 

Another oblique decision tree algorithm, one that uses a very different approach from 
CART-LC, is the Linear Machine Decision Trees (LMDT) system (Utgoff & Brodley, 1991; 
Brodley & Utgoff, 1992), which is a successor to the Perceptron Tree method (Utgoff, 1989; 
Utgoff & Brodley, 1990). Each internal node in an LMDT tree is a Linear Machine (Nilsson, 
1990). The training algorithm presents examples repeatedly at each node until the linear 
machine converges. Because convergence cannot be guaranteed, LMDT uses heuristics to 
determine when the node has stabilized. To make the training stable even when the set of 
training instances is not linearly separable, a "thermal training" method (Frean, 1990) is 
used, similar to simulated annealing. 

A third system that creates oblique trees is Simulated Annealing of Decision Trees 
(SADT) (Heath et al., 1993b) which, like OCl, uses randomization. SADT uses simulated 
annealing (Kirkpatrick, Gelatt, & Vecci, 1983) to find good values for the coefficients of 
the hyperplane at each node of a tree. SADT first places a hyperplane in a canonical 
location, and then iteratively perturbs all the coefficients by small random amounts. Ini- 
tially, when the temperature parameter is high, SADT accepts almost any perturbation of 
the hyperplane, regardless of how it changes the goodness score. However, as the system 
"cools down," only changes that improve the goodness of the split are likely to be accepted. 
Though SADT's use of randomization allows it to effectively avoid some local minima, it 
compromises on efficiency. It runs much slower than either CART-LC, LMDT or OCl, 
sometimes considering tens of thousands of hyperplanes at a single node before it finishes 
annealing. 

Our experiments in Section 4.3 include some results showing how all of these methods 
perform on three artificial domains. 

We next describe a way to combine some of the strengths of the methods just mentioned, 
while avoiding some of the problems. Our algorithm, OCl, uses deterministic hill climbing 
most of the time, ensuring computational efficiency. In addition, it uses two kinds of 
randomization to avoid local minima. By limiting the number of random choices, the 
algorithm is guaranteed to spend only polynomial time at each node in the tree. In addition, 
randomization itself has produced several benefits: for example, it means that the algorithm 
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To find a split of a set of examples T: 

Find the best axis-parallel split of T. Let I be the impurity of this split. 
Repeat R times: 

Choose a random hyperplane H . 

(For the first iteration, initialize H to be the best axis-parallel split.) 
Step 1: Until the impurity measure does not improve, do: 

Perturb each of the coefficients of H in sequence. 
Step 2: Repeat at most J times: 

Choose a random direction and attempt to perturb H in that direction. 

If this reduces the impurity of H , go to Step 1. 
Let Ii = the impurity of H. If Ii < I, then set I = Ii. 
Output the split corresponding to I. 

Figure 6: Overview of the OCl algorithm for a single node of a decision tree. 

can produce many different trees for the same data set. This offers the possibility of a new 
family of classifiers: A;-decision-tree algorithms, in which an example is classified by the 
majority vote of k trees. Heath et al. (f993a) have shown that A;-decision tree methods 
(which they call A;-DT) will consistently outperform single tree methods if classification 
accuracy is the main criterion. Finally, our experiments indicate that OCl efficiently finds 
small, accurate decision trees for many different types of classification problems. 

3. Oblique Classifier 1 (OCl) 

In this section we discuss details of the oblique decision tree induction system OCf . As 
part of this description, we include: 

• the method for finding coefficients of a hyperplane at each tree node, 

• methods for computing the impurity or goodness of a hyperplane, 

• a tree pruning strategy, and 

• methods for coping with missing and irrelevant attributes. 

Section 3.f focuses on the most complicated of these algorithmic details; i.e. the question of 
how to find a hyperplane that splits a given set of instances into two reasonably "pure" non- 
overlapping subsets. This randomized perturbation algorithm is the main novel contribution 
of OCf. Figure 6 summarizes the basic OCf algorithm, used at each node of a decision 
tree. This figure will be explained further in the following sections. 

3.1 Perturbation algorithm 

OCf imposes no restrictions on the orientation of the hyperplanes. However, in order to be 
at least as powerful as standard DT methods, it first finds the best axis-parallel (univariate) 
split at a node before looking for an oblique split. OCf uses an oblique split only when it 
improves over the best axis-parallel split.'* 

4. As pointed out in (Breiman et al., 1984, Chapter 5), it does not make sense to use an oblique split when 
the number of examples at a node n is less than or almost equal to the number of features d, because the 
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The search strategy for the space of possible hyperplanes is defined by the procedure 
that perturbs the current hyperplane H to a new location. Because there are an exponential 
number of distinct ways to partition the examples with a hyperplane, any procedure that 
simply enumerates all of them will be unreasonably costly. The two main alternatives 
considered in the past have been simulated annealing, used in the SADT system (Heath 
et al., 1993b), and deterministic heuristic search, as in CART-LC (Breiman et al., 1984). 
OCX combines these two ideas, using heuristic search until it finds a local minimum, and 
then using a non-deterministic search step to get out of the local minimum. (The non- 
deterministic step in OCX is not simulated annealing, however.) 

We will start by explaining how we perturb a hyperplane to split the training set T at 
a node of the decision tree. Let n be the number of examples in T, d be the number of 
attributes (or dimensions) for each example, and k be the number of categories. Then we 
can write Tj = (xji , Xj2, . . . , Xjd, Cj) for the jth example from the training set T, where Xji is 
the value of attribute i and Cj is the category label. As defined in Eq. X, the equation of the 
current hyperplane H at a node of the decision tree is written as J2i=ii'^i^i)~^'^d+i = 0. If we 
substitute a point (an example) Tj into the equation for H , we get J2i=ii'^i^ ji) ~^ (^d+i = Vj, 
where the sign of Vj tells us whether the point Tj is above or below the hyperplane H; 
i.e., if Vj > 0, then Tj is above H . If H splits the training set T perfectly, then all points 
belonging to the same category will have the same sign for Vj. i.e., sign(T^) = sign(T^) iff 
category(r8) = category(rj). 

OCX adjusts the coefficients of H individually, finding a locally optimal value for one 
coefficient at a time. This key idea was introduced by Breiman et al. It works as follows. 
Treat the coefficient variable, and treat all other coefficients as constants. Then 

Vj can be viewed as a function of a^- In particular, the condition that Tj is above H is 
equivalent to 

>0 

a„ > ~ = Uj (2) 

assuming that Xjm > 0, which we ensure by normalization. Using this definition of Uj, the 
point Tj is above H if > Uj, and below otherwise. By plugging all the points from T 
into this equation, we will obtain n constraints on the value of a^- 

The problem then is to find a value for that satisfies as many of these constraints 
as possible. (If all the constraints are satisfied, then we have a perfect split.) This problem 
is easy to solve optimally: simply sort all the values Uj, and consider setting to the 
midpoint between each pair of different values. This is illustrated in Figure 7. In the figure, 
the categories are indicated by font size; the larger C/j's belong to one category, and the 
smaller to another. For each distinct placement of the coefficient a^, OCX computes the 
impurity of the resulting split; e.g., for the location between Uq and Ur illustrated here, two 
examples on the left and one example on the right would be misclassified (see Section 3.3. X 
for different ways of computing impurity). As the figure illustrates, the problem is simply 
to find the best one- dimensional split of the Us, which requires considering just n—1 values 
for Urn- The value a'^ obtained by solving this one- dimensional problem is then considered 



data underfits the concept. By default, OCl uses only axis-parallel splits at tree nodes at which n < 2d. 
The user can vary this threshold. 
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Figure 7: Finding the optimal value for a single coefficient a^- Large U's correspond to 
examples in one category and small u's to another. 

Perturb (H,m) 
For j = I, . . . , n 

Compute Uj (Eq. 2) 
Sort Ui, . . . ,Un in non- decreasing order. 

= best univariate split of the sorted UjS. 
Hi = result of substituting for a™ in H . 
If (impurity(Hi) < impurity(H)) 

{ — ? Pfnove — Psiag } 

Else if (impurity(H) = impurity(Hi)) 
{ am = a'^ with probability Pmove 

Pfnove — Pmove 0.1 * Psiag } 

Figure 8: Perturbation algorithm for a single coefficient a.^. 

as a replacement for a.^. Let Hi be the hyperplane obtained by "perturbing" a.^ to a[^. If 
H has better (lower) impurity than Hi, then Hi is discarded. If Hi has lower impurity, Hi 
becomes the new location of the hyperplane. If H and Hi have identical impurities, then 
Hi replaces H with probability Pstag-^ Figure 8 contains pseudocode for our perturbation 
procedure. 

Now that we have a method for locally improving a coefficient of a hyperplane, we need 
to decide which of the d -\- 1 coefficients to pick for perturbation. We experimented with 
three different methods for choosing which coefficient to adjust, namely, sequential, best 
first and random. 

Seq: Repeat until none of the coefficient values is modified in the For loop: 

For i = 1 to (i, Perturb(i7, i) 
Best: Repeat until coefficient m remains unmodified: 

m = coefficient which when perturbed, results in the 
maximum improvement of the impurity measure. 

Perturb(i7, m) 

R-50: Repeat a fixed number of times (50 in our experiments): 
m = random integer between 1 and d -\- 1 
Perturb(i7, m) 

5. The parameter Pstag, denoting "stagnation probability", is the probability that a hyperplane is perturbed 
to a location that does not change the impurity measure. To prevent the impurity from remaining 
stagnant for a long time, Pstag decreases by a constant amount each time OCl makes a "stagnant" 
perturbation; thus only a constant number of such perturbations will occur at each node. This constant 
can be set by the user. Pstag is reset to 1 every time the global impurity measure is improved. 
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Our previous experiments (Murthy et al., 1993) indicated that the order of perturbation 
of the coefficients does not affect the classification accuracy as much as other parameters, 
especially the randomization parameters (see below). Since none of these orders was uni- 
formly better than any other, we used sequential (Seq) perturbation for all the experiments 
reported in Section 4. 

3.2 Randomization 

The perturbation algorithm halts when the split reaches a local minimum of the impurity 
measure. For OCl's search space, a local minimum occurs when no perturbation of any 
single coefficient of the current hyperplane will decrease the impurity measure. (Of course, 
a local minimum may also be a global minimum.) We have implemented two ways of 
attempting to escape local minima: perturbing the hyperplane with a random vector, and 
re-starting the perturbation algorithm with a different random initial hyperplane. 

The technique of perturbing the hyperplane with a random vector works as follows. 
When the system reaches a local minimum, it chooses a random vector to add to the 
coefficients of the current hyperplane. It then computes the optimal amount by which the 
hyperplane should be perturbed along this random direction. To be more precise, when 
a hyperplane H = J2i=i (^i^i + ^d+i cannot be improved by deterministic perturbation, 
OCl repeats the following loop / times (where / is a user-specified parameter, set to 5 by 
default). 

• Choose a random vector R = (ri, r2, . . . , rd+i). 

• Let a be the amount by which we want to perturb H in the direction R. In other 
words, let Hi = J2t=i + ari)x, + (ud+i + ard+i). 

• Find the optimal value for a. 

• If the hyperplane Hi thus obtained decreases the overall impurity, replace H with Hi, 
exit this loop and begin the deterministic perturbation algorithm for the individual 
coefficients. 

Note that we can treat a as the only variable in the equation for Hi . Therefore each of the 
n examples in T, if plugged into the equation for Hi, imposes a constraint on the value of 
a. OCl therefore can use its coefficient perturbation method (see Section 3.1) to compute 
the best value of a. If / random jumps fail to improve the impurity, OCl halts and uses 
H as the split for the current tree node. 

An intuitive way of understanding this random jump is to look at the dual space in which 
the algorithm is actually searching. Note that the equation H = J2i=i '^i^i + o^d+i defines 
a space in which the axes are the coefficients rather than the attributes Xi. Every point 
in this space defines a distinct hyperplane in the original formulation. The deterministic 
algorithm used in OCl picks a hyperplane and then adjusts coefficients one at a time. Thus 
in the dual space, OCl chooses a point and perturbs it by moving it parallel to the axes. 
The random vector R represents a random direction in this space. By finding the best value 
for a, OCl finds the best distance to adjust the hyperplane in the direction of R. 
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Note that this additional perturbation in a random direction does not significantly in- 
crease the time complexity of the algorithm (see Appendix A). We found in our experiments 
that even a single random jump, when used at a local minimum, proves to be very helpful. 
Classification accuracy improved for every one of our data sets when such perturbations 
were made. See Section 4.3 for some examples. 

The second technique for avoiding local minima is a variation on the idea of performing 
multiple local searches. The technique of multiple local searches is a natural extension 
to local search, and has been widely mentioned in the optimization literature (see Roth 
(1970) for an early example). Because most of the steps of our perturbation algorithm 
are deterministic, the initial hyperplane largely determines which local minimum will be 
encountered first. Perturbing a single initial hyperplane is thus unlikely to lead to the best 
split of a given data set. In cases where the random perturbation method fails to escape 
from local minima, it may be helpful to simply start afresh with a new initial hyperplane. 
We use the word restart to denote one run of the perturbation algorithms, at one node of 
the decision tree, using one random initial hyperplane.^ That is, a restart cycles through 
and perturbs the coefficients one at a time and then tries to perturb the hyperplane in a 
random direction when the algorithm reaches a local minimum. If this last perturbation 
reduces the impurity, the algorithm goes back to perturbing the coefficients one at a time. 
The restart ends when neither the deterministic local search nor the random jump can find 
a better split. One of the optional parameters to OCX specifies how many restarts to use. 
If more than one restart is used, then the best hyperplane found thus far is always saved. 
In all our experiments, the classification accuracies increased with more than one restart. 
Accuracy tended to increase up to a point and then level off (after about 20-50 restarts, 
depending on the domain). Overall, the use of multiple initial hyperplanes substantially 
improved the quality of the decision trees found (see Section 4.3 for some examples). 

By carefully combining hill-climbing and randomization, OCX ensures a worst case time 
of O(dn^logn) for inducing a decision tree. See Appendix A for a derivation of this upper 
bound. 



Best Axis-Parallel Split. It is clear that axis-parallel splits are more suitable for some 
data distributions than oblique splits. To take into account such distributions, OCX com- 
putes the best axis-parallel split and an oblique split at each node, and then picks the better 
of the two.'' Calculating the best axis-parallel split takes an additional O(dnlogn) time, 
and so does not increase the asymptotic time complexity of OCX. As a simple variant of 
the OCX system, the user can opt to "switch off" the oblique perturbations, thus building 
an axis-parallel tree on the training data. Section 4.2 empirically demonstrates that this 
axis-parallel variant of OCX compares favorably with existing axis-parallel algorithms. 



6. The first run through the algorithm at each node always begins at the location of the best axis-parallel 
hyperplane; all subsequent restarts begin at random locations. 

7. Sometimes a simple axis-parallel split is preferable to an oblique split, even if the oblique split has slightly 
lower impurity. The user can specify such a bias as an input parameter to OCl. 



13 



MURTHY, KASIF & SaLZBERG 



3.3 Other Details 

3.3.1 Impurity Measures 

OCl attempts to divide the rf- dimensional attribute space into homogeneous regions; i.e., 
regions that contain examples from just one category. The goal of adding new nodes to 
a tree is to split up the sample space so as to minimize the "impurity" of the training 
set. Some algorithms measure "goodness" instead of impurity, the difference being that 
goodness values should be maximized while impurity should be minimized. Many different 
measures of impurity have been studied (Breiman et al., 1984; Quinlan, 1986; Mingers, 
1989b; Buntine & Niblett, 1992; Fayyad & Irani, 1992; Heath et al., 1993b). 

The OCl system is designed to work with a large class of impurity measures. Stated 
simply, if the impurity measure uses only the counts of examples belonging to every category 
on both sides of a split, then OCl can use it. (See Murthy and Salzberg (1994) for ways of 
mapping other kinds of impurity measures to this class of impurity measures.) The user can 
plug in any impurity measure that fits this description. The OCl implementation includes 
six impurity measures, namely: 

1. Information Gain 

2. The Gini Index 

3. The Twoing Rule 

4. Max Minority 

5. Sum Minority 

6. Sum of Variances 

Though all six of the measures have been defined elsewhere in the literature, in some 
cases we have made slight modifications that are defined precisely in Appendix B. Our 
experiments indicated that, on average. Information Gain, Gini Index and the Twoing Rule 
perform better than the other three measures for both axis-parallel and oblique trees. The 
Twoing Rule is the current default impurity measure for OCl, and it was used in all of 
the experiments reported in Section 4. There are, however, artificial data sets for which 
Sum Minority and/or Max Minority perform much better than the rest of the measures. 
For instance. Sum Minority easily induces the exact tree for the POL data set described in 
Section 4.3.1, while all other methods have difficulty finding the best tree. 

Twoing Rule. The Twoing Rule was first proposed by Breiman et al. (1984). The value 
to be computed is defined as: 

k 

TwoingValue = (iTij/n) * (iT^I/n) * (^ \UI\Tl\ - R^|\TR\\f 

8 = 1 

where \Tl\ {\Tr\) is the number of examples on the left (right) of a split at node T, n is 
the number of examples at node T, and Li (Ri) is the number of examples in category i on 
the left (right) of the split. The TwoingValue is actually a goodness measure rather than 
an impurity measure. Therefore OCl attempts to minimize the reciprocal of this value. 
The remaining five impurity measures implemented in OCl are defined in Appendix B. 
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3.3.2 Pruning 

Virtually all decision tree induction systems prune the trees they create in order to avoid 
overfitting the data. Many studies have found that judicious pruning results in both smaller 
and more accurate classifiers, for decision trees as well as other types of machine learning 
systems (Quinlan, 1987; Niblett, 1986; Cestnik, Kononenko, & Bratko, 1987; Kodratoff 
& Manago, 1987; Cohen, 1993; Hassibi & Stork, 1993; Wolpert, 1992; Schaffer, 1993). 
For the OCl system we implemented an existing pruning method, but note that any tree 
pruning method will work fine within OCl. Based on the experimental evaluations of 
Mingers (1989a) and other work cited above, we chose Breiman et al.'s Cost Complexity 
(CC) pruning (1984) as the default pruning method for OCl. This method, which is also 
called Error Complexity or Weakest Link pruning, requires a separate pruning set. The 
pruning set can be a randomly chosen subset of the training set, or it can be approximated 
using cross validation. OCl randomly chooses 10% (the default value) of the training data 
to use for pruning. In the experiments reported below, we only used this default value. 

Briefiy, the idea behind CC pruning is to create a set of trees of decreasing size from the 
original, complete tree. All these trees are used to classify the pruning set, and accuracy is 
estimated from that. CC pruning then chooses the smallest tree whose accuracy is within k 
standard errors squared of the best accuracy obtained. When the 0-SE rule (k = 0) is used, 
the tree with highest accuracy on the pruning set is selected. When A; > 0, smaller tree size 
is preferred over higher accuracy. For details of Cost Complexity pruning, see Breiman et 
al. (1984) or Mingers (1989a). 

3.3.3 Irrelevant attributes 

Irrelevant attributes pose a significant problem for most machine learning methods (Breiman 
et al., 1984; Aha, 1990; Almuallin & Dietterich, 1991; Kira & Rendell, 1992; Salzberg, 1992; 
Cardie, 1993; Schlimmer, 1993; Langley & Sage, 1993; Brodley & Utgoff, 1994). Decision 
tree algorithms, even axis-parallel ones, can be confused by too many irrelevant attributes. 
Because oblique decision trees learn the coefficients of each attribute at a DT node, one 
might hope that the values chosen for each coefficient would refiect the relative importance 
of the corresponding attributes. Clearly, though, the process of searching for good coefficient 
values will be much more efficient when there are fewer attributes; the search space is much 
smaller. For this reason, oblique DT induction methods can benefit substantially by using a 
feature selection method (an algorithm that selects a subset of the original attribute set) in 
conjunction with the coefficient learning algorithm (Breiman et al., 1984; Brodley & Utgoff, 
1994). 

Currently, OCl does not have a built-in mechanism to select relevant attributes. How- 
ever, it is easy to include any of several standard methods (e.g., stepwise forward selection 
or stepwise backward selection) or even an ad hoc method to select features before running 
the tree-building process. For example, in separate experiments on data from the Hubble 
Space Telescope (Salzberg, Chandar, Ford, Murthy, & White, 1994), we used feature selec- 
tion methods as a preprocessing step to OCl, and reduced the number of attributes from 20 
to 2. The resulting decision trees were both simpler and more accurate. Work is currently 
underway to incorporate an efficient feature selection technique into the OCl system. 
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Regarding missing values, if an example is missing a value for any attribute, OCX uses 
the mean value for that attribute. One can of course use other techniques for handling 
missing values, but those were not considered in this study. 

4. Experiments 

In this section, we present two sets of experiments to support the following two claims. 

1. OCX compares favorably over a variety of real- world domains with several existing 
axis-parallel and oblique decision tree induction methods. 

2. Randomization, both in the form of multiple local searches and random jumps, im- 
proves the quality of decision trees produced by OCX. 

The experimental method used for all the experiments is described in Section 4.X. Sec- 
tions 4.2 and 4.3 describe experiments corresponding to the above two claims. Each experi- 
mental section begins with a description of the data sets, and then presents the experimental 
results and discussion. 

4.1 Experimental Method 

We used five-fold cross validation (CV) in all our experiments to estimate classification 
accuracy. A A;-fold CV experiment consists of the following steps. 

X. Randomly divide the data into k equal-sized disjoint partitions. 

2. For each partition, build a decision tree using all data outside the partition, and test 
the tree on the data in the partition. 

3. Sum the number of correct classifications of the k trees and divide by the total number 

of instances to compute the classification accuracy. Report this accuracy and the 
average size of the k trees. 

Each entry in Tables X and 2 is a result of ten 5-fold CV experiments; i.e., the result of tests 
that used 50 decision trees. Each of the ten 5-fold cross validations used a different random 
partitioning of the data. Each entry in the tables reports the mean and standard deviation 
of the classification accuracy, followed by the mean and standard deviation of the decision 
tree size (measured as the number of leaf nodes). Good results should have high values for 
accuracy, low values for tree size, and small standard deviations. 

In addition to OCX, we also included in the experiments an axis-parallel version of OCX, 
which only considers axis-parallel hyperplanes. We call this version, described in Section 3.2, 
OCX-AP. In all our experiments, both OCX and OCX-AP used the Twoing Rule (Section 
3.3.x) to measure impurity. Other parameters to OCX took their default values unless stated 
otherwise. (Defaults include the following: number of restarts at each node: 20. Number 
of random jumps attempted at each local minimum: 5. Order of coefficient perturbation: 
Sequential. Pruning method: Cost Complexity with the 0-SE rule, using X0% of the training 
set exclusively for pruning.) 

In our comparison, we used the oblique version of the CART algorithm, CART-LC. 
We implemented our own version of CART-LC, following the description in Breiman et 
al. (X984, Chapter 5); however, there may be differences between our version and other 
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versions of this system (note that CART-LC is not freely available). Our implementation 
of CART-LC measured impurity with the Twoing Rule and used 0-SE Cost Complexity 
pruning with a separate test set, just as OCX does. We did not include any feature selection 
methods in CART-LC or in OCX, and we did not implement normalization. Because the 
CART coefficient perturbation algorithm may alternate indefinitely between two locations 
of a hyperplane (see Section 2), we imposed an arbitrary limit of XOO such perturbations 
before forcing the perturbation algorithm to halt. 

We also included axis-parallel CART and C4.5 in our comparisons. We used the im- 
plementations of these algorithms from the IND 2.X package (Buntine, X992). The default 
cartO and c4.5 "styles" defined in the package were used, without altering any parameter 
settings. The cartO style uses the Twoing Rule and 0-SE cost complexity pruning with 
XO-fold cross validation. The pruning method, impurity measure and other defaults of the 
c4.5 style are the same as those described in Quinlan (X993a). 

4.2 OCl vs. Other Decision Tree Induction Methods 

Table X compares the performance of OCX to three well-known decision tree induction 
methods plus OCX-AP on six different real-world data sets. In the next section we will 
consider artificial data, for which the concept definition can be precisely characterized. 

4.2.x Description of Data Sets 

Star/Galaxy Discrimination. Two of our data sets came from a large set of astronom- 
ical images collected by Odewahn et al. (Odewahn, Stockwell, Pennington, Humphreys, & 
Zumach, X992). In their study, they used these images to train artificial neural networks 
running the perceptron and back propagation algorithms. The goal was to classify each ex- 
ample as either "star" or "galaxy." Each image is characterized by X4 real- valued attributes, 
where the attributes were measurements defined by astronomers as likely to be relevant for 
this task. The objects in the image were divided by Odewahn et al. into "bright" and "dim" 
data sets based on the image intensity values, where the dim images are inherently more 
difficult to classify. (Note that the "bright" objects are only bright in relation to others 
in this data set. In actuality they are extremely faint, visible only to the most powerful 
telescopes.) The bright set contains 2462 objects and the dim set contains 4X92 objects. 

In addition to the results reported in Table X, the following results have appeared on 
the Star/Galaxy data. Odewahn et al. (X992) reported accuracy of 99.8% accuracy on the 
bright objects, and 92.0% on the dim ones, although it should be noted that this study 
used a single training and test set partition. Heath (X992) reported 99.0% accuracy on the 
bright objects using SADT, with an average tree size of 7.03 leaves. This study also used 
a single training and test set. Salzberg (X992) reported accuracies of 98.8% on the bright 
objects, and 95. X% on the dim objects, using X-Nearest Neighbor (X-NN) coupled with a 
feature selection method that reduces the number of features. 

Breast Cancer Diagnosis. Mangasarian and Bennett have compiled data on the prob- 
lem of diagnosing breast cancer to test several new classification methods (Mangasarian 
et al., X990; Bennett & Mangasarian, X992, X994a). This data represents a set of patients 
with breast cancer, where each patient was characterized by nine numeric attributes plus 
the diagnosis of the tumor as benign or malignant. The data set currently has 683 entries 
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CART-LC 


98.8±0.2 


92.8±0.5 


95.3±0.6 


93.5±2.9 


81.4±1.2 


73.7±1.2 




3.9±1.3 


24.2±8.7 


3.5±0.9 


3.2±0.3 


5.8±3.2 


8.0±5.2 


OCl-AP 


98.1±0.2 


94.0±0.2 


94.5±0.5 


92.7±2.4 


81.8±1.0 


73.8±1.0 




6.9±2.4 


29.3±8.8 


6.4±1.7 


3.2±0.3 


8.6±4.5 


11.4±7.5 


CART-AP 


98.5±0.5 


94.2±0.7 


95.0±1.6 


93.8±3.7 


82.1±3.5 


73.9±3.4 




13.9±5.7 


30.4±10 


11.5±7.2 


4.3±1.6 


15.1±10 


11.5±9.1 


C4.5 


98.5±0.5 


93.3±0.8 


95.3±2.0 


95.1±3.2 


83.2±3.1 


71.4±3.3 




14.3±2.2 


77.9±7.4 


9.8±2.2 


4.6±0.8 


28.2±3.3 


56.3±7.9 



Table 1: Comparison of OCl and other decision tree induction methods on six different 
data sets. The first line for each method gives accuracies, and the second line gives 
average tree sizes. The highest accuracy for each domain appears in boldface. 



and is available from the UC Irvine machine learning repository (Murphy & Aha, 1994). 
Heath et al. (1993b) reported 94.9% accuracy on a subset of this data set (it then had 
only 470 instances), with an average decision tree size of 4.6 nodes, using SADT. Salzberg 
(1991) reported 96.0% accuracy using 1-NN on the same (smaller) data set. Herman and 
Yeung (1992) reported 99.0% accuracy using piece-wise linear classification, again using a 
somewhat smaller data set. 

Classifying Irises. This is Fisher's famous iris data, which has been extensively studied 
in the statistics and machine learning literature. The data consists of 150 examples, where 
each example is described by four numeric attributes. There are 50 examples of each of 
three different types of iris fiower. Weiss and Kapouleas (1989) obtained accuracies of 96.7% 
and 96.0% on this data with back propagation and 1-NN, respectively. 

Housing Costs in Boston. This data set, also available as a part of the UCI ML repos- 
itory, describes housing values in the suburbs of Boston as a function of 12 continuous 
attributes and 1 binary attribute (Harrison & Rubinfeld, 1978). The category variable (me- 
dian value of owner-occupied homes) is actually continuous, but we discretized it so that 
category = 1 if value < $21000, and 2 otherwise. For other uses of this data, see (Belsley, 
1980; Quinlan, 1993b). 

Diabetes diagnosis. This data catalogs the presence or absence of diabetes among Pima 
Indian females, 21 years or older, as a function of eight numeric- valued attributes. The 
original source of the data is the National Institute of Diabetes and Digestive and Kidney 
Diseases, and it is now available in the UCI repository. Smith et al. (1988) reported 76% 
accuracy on this data using their ADAP learning algorithm, using a different experimental 
method from that used here. 
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4.2.2 Discussion 

The table shows that, for the six data sets considered here, OCl consistently finds better 
trees than the original oblique CART method. Its accuracy was greater in all six domains, 
although the difference was significant (more than 2 standard deviations) only for the dim 
star/galaxy problem. The average tree sizes were roughly equal for five of the six domains, 
and for the dim stars and galaxies, OCl found considerably smaller trees. These differences 
will be analyzed and quantified further by using artificial data, in the following section. 

Out of the five decision tree induction methods, OCl has the highest accuracy on four 
of the six domains: bright stars, dim stars, cancer diagnosis, and diabetes diagnosis. On the 
remaining two domains, OCl has the second highest accuracy in each case. Not surprisingly, 
the oblique methods (OCl and CART-LC) generally find much smaller trees than the axis- 
parallel methods. This difference can be quite striking for some domains — note, for example, 
that OCl produced a tree with just 13 nodes on average for the dim star/galaxy problem, 
while C4.5 produced a tree with 78 nodes, 6 times larger. Of course, in domains for which 
an axis-parallel tree is the appropriate representation, axis-parallel methods should compare 
well with oblique methods in terms of tree size. In fact, for the Iris data, all the methods 
found similar-sized trees. 

4.3 Randomization Helps OCl 

In our second set of experiments, we examine more closely the effect of introducing random- 
ized steps into the algorithm for finding oblique splits. Our experiments demonstrate that 
OCl's ability to produce an accurate tree from a set of training data is clearly enhanced 
by the two kinds of randomization it uses. More precisely, we use three artificial data sets 
(for which the underlying concept is known to the experimenters) to show that OCl's per- 
formance improves substantially when the deterministic hill climbing is augmented in any 
of three ways: 

• with multiple restarts from random initial locations, 

• with perturbations in random directions at local minima, or 

• with both of the above randomization steps. 

In order to find clear differences between algorithms, one needs to know that the concept 
underlying the data is indeed difficult to learn. For simple concepts (say, two linearly 
separable classes in 2-D), many different learning algorithms will produce very accurate 
classifiers, and therefore the advantages of randomization may not be detectable. It is 
known that many of the commonly-used data sets from the UCI repository are easy to 
learn with very simple representations (Holte, 1993); therefore those data sets may not be 
ideal for our purposes. Thus we created a number of artificial data sets that present different 
problems for learning, and for which we know the "correct" concept definition. This allows 
us to quantify more precisely how the parameters of our algorithm affect its performance. 

A second purpose of this experiment is to compare OCl's search strategy with that 
of two existing oblique decision tree induction systems - LMDT (Brodley & Utgoff, 1992) 
and SADT (Heath et al., 1993b). We show that the quality of trees induced by OCl is as 
good as, if not better than, that of the trees induced by these existing systems on three 
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artificial domains. We also show that OCl achieves a good balance between amount of 
effort expended in search and the quality of the tree induced. 

Both LMDT and SADT used information gain for this experiment. However, we did 
not change OCf 's default measure (the Twoing Rule) because we observed, in experiments 
not reported here, that OCl with information gain does not produce significantly different 
results. The maximum number of successive, unproductive perturbations allowed at any 
node was set at 10000 for SADT. For all other parameters, we used default settings provided 
with the systems. 

4.3.1 Description of Artificial Data 

LSIO The LSIO data set has 2000 instances divided into two categories. Each instance is 
described by ten attributes xi,. . . ,xio, whose values are uniformly distributed in the range 
[0,1]. The data is linearly separable with a 10-D hyperplane (thus the name LSIO) defined 
by the equation xi + X2 + + X4 + < xq + xi + xg + xg + xiq. The instances were all 
generated randomly and labelled according to which side of this hyperplane they fell on. 
Because oblique DT induction methods intuitively should prefer a linear separator if one 
exists, it is interesting to compare the various search techniques on this data set where we 
know a separator exists. The task is relatively simple for lower dimensions, so we chose 
10- dimensional data to make it more difficult. 

POL This data set is shown in Figure 9. It has 2000 instances in two dimensions, again 
divided into two categories. The underlying concept is a set of four parallel oblique lines 
(thus the name POL), dividing the instances into five homogeneous regions. This concept is 
more difficult to learn than a single linear separator, but the minimal-size tree is still quite 
small. 

RGB RGB stands for "rotated checker board"; this data set has been the subject of 
other experiments on hard classification problems for decision trees (Murthy & Salzberg, 
1994). The data set, shown in Figure 9, has 2000 instances in 2-D, each belonging to one of 
eight categories. This concept is difficult to learn for any axis-parallel method, for obvious 
reasons. It is also quite difficult for oblique methods, for several reasons. The biggest 
problem is that the "correct" root node, as shown in the figure, does not separate out any 
class by itself. Some impurity measures (such as Sum Minority) will fail miserably on this 
problem, although others (e.g., the Twoing Rule) work much better. Another problem is 
that a deterministic coefficient perturbation algorithm can get stuck in local minima in 
many places on this data set. 

Table 2 summarizes the results of this experiment in three smaller tables, one for each 
data set. In each smaller table, we compare four variants of OCl with LMDT and SADT. 
The different results for OCl were obtained by varying both the number of restarts and the 
number of random jumps. When random jumps were used, up to twenty random jumps 
were tried at each local minimum. As soon as one was found that improved the impurity 
of the current hyperplane, the algorithm moved the hyperplane and started running the 
deterministic perturbation procedure again. If none of the 20 random jumps improved the 
impurity, the search halted and further restarts (if any) were tried. The same training and 
test partitions were used for all methods for each cross-validation run (recall that the results 
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Figure 9: The POL and RGB data sets 



Linearly Separable 10-D (LSIO) data 


R:J 


Accuracy 


Size 


Hyperplanes 


0:0 


89.8±1.2 


67.0±5.8 


2756 


0:20 


91.5±1.5 


55.2±7.0 


3824 


20:0 


95.0±0.6 


25.6±2.4 


24913 


20:20 


97.2±0.7 


13.9±3.2 


30366 


LMDT 


99.7±0.2 


2.2±0.5 


9089 


SADT 


95.2±1.8 


15.5±5.7 


349067 


Parallel Oblique Lines (POL) data 


R:J 


Accuracy 


Size 


Hyperplanes 


0:0 


98.3±0.3 


21.6±1.9 


164 


0:20 


99.3±0.2 


9.0±1.0 


360 


20:0 


99.1±0.2 


14.2±1.1 


3230 


20:20 


99.6±0.1 


5.5±0.3 


4852 


LMDT 


89.6±10.2 


41.9±19.2 


1732 


SADT 


99.3±0.4 


8.4±2.1 


85594 


Rotated Checker Board (RGB) data 


R:J 


Accuracy 


Size 


Hyperplanes 


0:0 


98.4±0.2 


35.5±1.4 


573 


0:20 


99.3±0.3 


19.7±0.8 


1778 


20:0 


99.6±0.2 


12.0±1.4 


6436 


20:20 


99.8±0.1 


8.7±0.4 


11634 


LMDT 


95.7±2.3 


70.1±9.6 


2451 


SADT 


97.9±1.1 


32.5±4.9 


359112 



Table 2: The elFect of randomization in OCl. The first column, labelled R:J, shows the 
number of restarts (R) followed by the maximum number of random jumps (J) 
attempted by OCl at each local minimum. Results with LMDT and SADT are 
included for comparison after the four variants of OCl. Size is average tree size 
measured by the number of leaf nodes. The third column shows the average 
number of hyperplanes each algorithm considered while building one tree. 
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are an average of ten 5-fold CVs). The trees were not pruned for any of the algorithms, 
because the data were noise-free and furthermore the emphasis was on search. 

Table 2 also includes the number of hyperplanes considered by each algorithm while 
building a complete tree. Note that for OCX and SADT, the number of hyperplanes con- 
sidered is generally much larger than the number of perturbations actually made, because 
both these algorithms compare newly generated hyperplanes to existing hyperplanes before 
adjusting an existing one. Nevertheless, this number is a good estimate of much effort 
each algorithm expends, because every new hyperplane must be evaluated according to the 
impurity measure. For LMDT, the number of hyperplanes considered is identical to the 
actual number of perturbations. 

4.3.2 Discussion 

The OCX results here are quite clear. The first line of each table, labelled 0:0, gives the 
accuracies and tree sizes when no randomization is used — this variant is very similar 
to the CART-LC algorithm. As we increase the use of randomization, accuracy increases 
while tree size decreases, which is exactly the result we had hoped for when we decided to 
introduce randomization into the method. 

Looking more closely at the tables, we can ask about the effect of random jumps alone. 
This is illustrated in the second line (0:20) of each table, which attempted up to 20 random 
jumps at each local minimum and no restarts. Accuracy increased by X-2% on each domain, 
and tree size decreased dramatically, roughly by a factor of two, in the POL and RGB 
domains. Note that because there is no noise in these domains, very high accuracies should 
be expected. Thus increases of more than a few percent in accuracy are not possible. 

Looking at the third line of each sub-table in Table 2, we see the effect of multiple restarts 
on OCX. With 20 restarts but no random jumps to escape local minima, the improvement 
is even more noticeable for the LSXO data than when random jumps alone were used. For 
this data set, accuracy jumped significantly, from 89.8 to 95.0%, while tree size dropped 
from 67 to 26 nodes. For the POL and RGB data, the improvements were comparable to 
those obtained with random jumps. For the RGB data, tree size dropped by a factor of 3 
(from 36 leaf nodes to X2 leaf nodes) while accuracy increased from 98.4 to 99.6%. 

The fourth line of each table shows the effect of both the randomized steps. Among the 
OGX entries, this line has both the highest accuracies and the smallest trees for all three 
data sets, so it is clear that randomization is a big win for these kinds of problems. In 
addition, note that the smallest tree for the RGB data should have eight leaf nodes, and 
OGX's average trees, without pruning, had just 8.7 leaf nodes. It is clear that for this data 
set, which we thought was the most difficult one, OGX came very close to finding the optimal 
tree on nearly every run. (Recall that numbers in the table are the average of XO 5-fold 
GV experiments; i.e., an average of 50 decision trees.) The LSXO data show how difficult it 
can be to find a very simple concept in higher dimensions — the optimal tree there is just a 
single hyperplane (two nodes), but OGX was unable to find it with the current parameter 
settings.® The POL data required a minimum of 5 leaf nodes, and OGX found this minimal- 
size tree most of the time, as can be seen from the table. Although not shown in the Table, 

8. In a separate experiment, we found that OCl consistently finds the linear separator for the LSfO data 
when 10 restarts and 200 random jumps are used. 
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OCl using Sum Minority performed better for the POL data than the Twoing Rule or any 
other impurity measure; i.e., it found the correct tree using less time. 

The results of LMDT and SADT on this data lead to some interesting insights. Not 
surprisingly, LMDT does very well on the linearly separable (LSIO) data, and does not 
require an inordinate amount of search. Clearly, if the data is linearly separable, one should 
use a method such as LMDT or linear programming. OCl and SADT have difficulty finding 
the linear separator, although in our experiments OCf did eventually find it, given sufficient 
time. 

On the other hand, for both of the non-linearly separable data sets, LMDT produces 
much larger trees that are significantly less accurate than those produced by OCf and 
SADT. Even the deterministic variant of OCf (using zero restarts and zero random jumps) 
outperforms LMDT on these problems, with much less search. 

Although SADT sometimes produces very accurate trees, its main weakness was the 
enormous amount of search time it required, roughly fO-20 times greater than OCf even 
using the 20:20 setting. One explanation of OCf 's advantage is its use of directed search, as 
opposed to the strictly random search used by simulated annealing. Overall, Table 2 shows 
that OCf's use of randomization was quite effective for the non-linearly separable data. 

It is natural to ask why randomization helps OCf in the task of inducing decision trees. 
Researchers in combinatorial optimization have observed that randomized search usually 
succeeds when the search space holds an abundance of good solutions (Gupta, Smolka, 
& Bhaskar, f994). Furthermore, randomization can improve upon deterministic search 
when many of the local maxima in a search space lead to poor solutions. In OCf's search 
space, a local maximum is a hyperplane that cannot be improved by the deterministic 
search procedure, and a "solution" is a complete decision tree. If a significant fraction 
of local maxima lead to bad trees, then algorithms that stop at the first local maximum 
they encounter will perform poorly. Because randomization allows OCf to consider many 
different local maxima, if a modest percentage of these maxima lead to good trees, then it 
has a good chance of finding one of those trees. Our experiments with OCf thus far indicate 
that the space of oblique hyperplanes usually contains numerous local maxima, and that a 
substantial percentage of these locally good hyperplanes lead to good decision trees. 

5. Conclusions and Future Work 

This paper has described OCf, a new system for constructing oblique decision trees. We 
have shown experimentally that OCf can produce good classifiers for a range of real-world 
and artificial domains. We have also shown how the use of randomization improves upon 
the original algorithm proposed by Breiman et al. (f984), without significantly increasing 
the computational cost of the algorithm. 

The use of randomization might also be beneficial for axis-parallel tree methods. Note 
that although they do find the optimal test (with respect to an impurity measure) for each 
node of a tree, the complete tree may not be optimal: as is well known, the problem of 
finding the smallest tree is NP-Complete (Hyafil & Rivest, f976). Thus even axis-parallel 
decision tree methods do not produce "ideal" decision trees. Quinlan has suggested that his 
windowing algorithm might be used as a way of introducing randomization into C4.5, even 
though the algorithm was designed for another purpose (Quinlan, f993a). (The windowing 
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algorithm selects a random subset of the training data and builds a tree using that.) We 
believe that randomization is a powerful tool in the context of decision trees, and our 
experiments are just one example of how it might be exploited. We are in the process of 
conducting further experiments to quantify more accurately the effects of different forms of 
randomization. 

It should be clear that the ability to produce oblique splits at a node broadens the capa- 
bilities of decision tree algorithms, especially as regards domains with numeric attributes. 
Of course, axis-parallel splits are simpler, in the sense that the description of the split only 
uses one attribute at each node. OCX uses oblique splits only when their impurity is less 
than the impurity of the best axis-parallel split; however, one could easily penalize the 
additional complexity of an oblique split further. This remains an open area for further 
research. A more general point is that if the domain is best captured by a tree that uses 
oblique hyperplanes, it is desirable to have a system that can generate that tree. We have 
shown that for some problems, including those used in our experiments, OCX builds small 
decision trees that capture the domain well. 

Appendix A. Complexity Analysis of OCl 

In the following, we show that OCX runs efficiently even in the worst case. For a data 
set with n examples (points) and d attributes per example, OCX uses at most O(dn^logn) 
time. We assume n > d for our analysis. 

For the analysis here, we assume the coefficients of a hyperplane are adjusted in sequen- 
tial order (the Seq method described in the paper). The number of restarts at a node will 
be r, and the number of random jumps tried will be j. Both r and j are constants, fixed in 
advance of running the algorithm. 

Initializing the hyperplane to a random position takes just 0(d) time. We need to 
consider first the maximum amount of work OCX can do before it finds a new location for 
the hyperplane. Then we need to consider how many times it can move the hyperplane. 

X. Attempting to perturb the first coefficient (ai) takes 0(dn-\-nlogn) time. Computing 
C/j's for all the points (equation 2) requires 0(dn) time, and sorting the C/j's takes 
O(ralogra). This gives us 0(dn-\- ralogra) work. 

2. If perturbing ai does not improve things, we try to perturb 02. Computing all the new 
C/j's will take just 0(n) time because only one term is different for each Ui. Re-sorting 
will take O (ralogra), so this step takes 0(n) + O (ralogra) = O (ralogra) time. 

3. Likewise 03, . . . , will each take 0(n log ra) additional time, assuming we still have not 
found a better hyperplane after checking each coefficient. Thus the total time to cycle 
through and attempt to perturb all these additional coefficients is (rf— X)*0(ralogra) = 
0(dn logra). 

4. Summing up, the time to cycle through all coefficients is O(dnlog n)-\-0(dn-\-nlog ra) = 
0(dn logra). 

5. If none of the coefficients improved the split, then we attempt to make up to j random 
jumps. Since j is a constant, we will just consider j = 1 for our analysis. This step 
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involves choosing a random vector and running the perturbation algorithm to solve 
for a, as explained in Section 3.2. As before, we need to compute a set of C/j's and sort 
them, which takes 0(dn + nlogn) time. Because this amount of time is dominated by 
the time to adjust all the coefficients, the total time so far is still O(dnlogn). This is 
the most time OCf can spend at a node before either halting or finding an improved 
hyperplane. 

6. Assuming OCf is using the Sum Minority or Max Minority error measure, it can only 
reduce the impurity of the hyperplane n times. This is clear because each improvement 
means one more example will be correctly classified by the new hyperplane. Thus the 
total amount of work at a node is limited to ra * O(dnlogn) = 0(dn^ log n). (This 
analysis extends, with at most linear cost factors, to Information Gain, Gini Index and 
Twoing Rule when there are two categories. It will not apply to a measure that, for 
example, uses the distances of mis-classified objects to the hyperplane.) In practice, 
we have found that the number of improvements per node is much smaller than n. 

Assuming that OCf only adjusts a hyperplane when it improves the impurity measure, 
it will do O(dn^logn) work in the worst case. 

However, OCf allows a certain number of adjustments to the hyperplane that do not 
improve the impurity, although it will never accept a change that worsens the impurity. 
The number allowed is determined by a constant known as "stagnant-perturbations". Let 
this value be s. This works as follows. 

Each time OCf finds a new hyperplane that improves on the old one, it resets a counter 
to zero. It will move the new hyperplane to a different location that has equal impurity at 
most s times. After each of these moves it repeats the perturbation algorithm. Whenever 
impurity is reduced, it re-starts the counter and again allows s moves to equally good 
locations. Thus it is clear that this feature just increases the worst-case complexity of OCf 
by a constant factor, s. 

Finally, note that the overall cost of OCf is also O(dn^logn), i.e., this is an upper 
bound on the total running time of OCf independent of the size of the tree it ends up 
creating. (This upper bound applies to Sum Minority and Max Minority; an open question 
is whether a similar upper bound can be proven for Information Gain or the Gini Index.) 
Thus the worst-case asymptotic complexity of our system is comparable to that of systems 
that construct axis-parallel decision trees, which have O(dn^) worst-case complexity. To 
sketch the intuition that leads to this bound, let G be the total impurity summed over all 
leaves in a partially constructed tree (i.e., the sum of currently misclassified points in the 
tree). Now observe that each time we run the perturbation algorithm on any node in the 
tree, we either halt or improve G by at least one unit. The worst-case analysis for one node 
is realized when the perturbation algorithm is run once for every one of the n examples, 
but when this happens, there would no longer be any mis-classified examples and the tree 
would be complete. 

Appendix B. Definitions of impurity measures available in OCl 

In addition to the Twoing Rule defined in the text, OCf contains built-in definitions of five 
additional impurity measures, defined as follows. In each of the following definitions, the 
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set of examples T at the node about to be split contains ra (> 0) instances that belong to 
one of A; categories. (Initially this set is the entire training set.) A hyperplane H divides T 
into two non-overlapping subsets Tl and Tr (i.e., left and right). Lj and Rj are the number 
of instances of category j in Tl and Tr respectively. All the impurity measures initially 
check to see if Tl and Tr are homogeneous (i.e., all examples belong to the same category), 
and if so return minimum (zero) impurity. 

Information Gain. This measure of information gained from a particular split was pop- 
ularized in the context of decision trees by Quinlan (1986). Quinlan's definition makes 
information gain a goodness measure; i.e., something to maximize. Because OCX attempts 
to minimize whatever impurity measure it uses, we use the reciprocal of the standard value 
of information gain in the OCX implementation. 

Gini Index. The Gini Criterion (or Index) was proposed for decision trees by Breiman et 
al. (X984). The Gini Index as originally defined measures the probability of misclassification 
of a set of instances, rather than the impurity of a split. We implement the following 
variation: 

k 

GiniL = X.0-^(^/|ri|)2 

8 = 1 

k 

GimR = I. {)-Y,{R^ I \Tr\? 

8 = 1 

Impurity = (\Tl\ * GiniL + \Tr\ * GiniR)/ra 

where GiniL is the Gini Index on the "left" side of the hyperplane and GiniR is that on the 
right . 

Max Minority. The measures Max Minority, Sum Minority and Sum Of Variances were 
defined in the context of decision trees by Heath, Kasif, and Salzberg (X993b).^ Max 
Minority has the theoretical advantage that a tree built minimizing this measure will have 
depth at most log n. Our experiments indicated that this is not a great advantage in 
practice: seldom do other impurity measures produce trees substantially deeper than those 
produced with Max Minority. The definition is: 

k 

MinorityL = ^ Li 

8'=l,8'7^maxLi 

k 

MinorityR = ^ R, 

8 = l,87^maxi?i 

Max Minority = max(MinorityL, MinorityR) 
9. Sum Of Variances was called Sum of Impurities by Heath et al. 
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Sum Minority. This measure is very similar to Max Minority. If MinorityL and Minor- 
ityR are defined as for the Max Minority measure, then Sum Minority is just the sum of 
these two values. This measure is the simplest way of quantifying impurity, as it simply 
counts the number of misclassified instances. 

Though Sum Minority performs well on some domains, it has some obvious fiaws. As 
one example, consider a domain in which n = 100,6? = 1, and k = 2 (i.e., 100 examples, 1 
numeric attribute, 2 classes). Suppose that when the examples are sorted according to the 
single attribute, the first 50 instances belong to category 1, followed by 24 instances of cat- 
egory 2, followed by 26 instances of category 1. Then a// possible splits for this distribution 
have a sum minority of 24. Therefore it is impossible when using Sum Minority to distin- 
guish which split is preferable, although splitting at the alternations between categories is 
clearly better. 

Sum Of Variances. The definition of this measure is: 

\Tl\ \Tl\ 
Variance! = ^ (Cat(TL,) - J2 Cat(TL^ )/\TL\ f 

8=1 J=l 

\Tr\ \Tr\ 
VarianceR = ^ (Cat(TR,) - ^ C at(TR^) / \TR\ f 

8 = 1 J = l 

Sum of Variances = VarianceL -|- VarianceR 

where Cat(rj) is the category of instance Ti. As this measure is computed using the actual 
class labels, it is easy to see that the impurity computed varies depending on how numbers 
are assigned to the classes. For instance, if Ti consists of 10 points of category 1 and 3 
points of category 2, and if T2 consists of 10 points of category 1 and 3 points of category 
5, then the Sum Of Variances values are different for Ti and T2. To avoid this problem, 
OCX uniformly reassigns category numbers according to the frequency of occurrence of each 
category at a node before computing the Sum Of Variances. 

Acknowledgements 

The authors thank Richard Beigel of Yale University for suggesting the idea of jumping in a 
random direction. Thanks to Wray Buntine of Nasa Ames Research Center for providing the 
IND 2.1 package, to Carla Brodley for providing the LMDT code, and to David Heath for 
providing the SADT code and for assisting us in using it. Thanks also to three anonymous 
reviewers for many helpful suggestions. This material is based upon work supported by the 
National Science foundation under Grant Nos. IRI-9116843, IRI-9223591, and IRI-9220960. 

References 

Aha, D. (1990). A Study of Instance-Based Algorithms for Supervised Learning: Mathemat- 
ical, empirical and psychological evaluations. Ph.D. thesis. Department of Information 
and Computer Science, University of California, Irvine. 



27 



MURTHY, KASIF & SaLZBERG 



Almuallin, H., & Dietterich, T. (1991). Learning with many irrelevant features. In Pro- 
ceedings of the Ninth National Conference on Artificial Intelligence, pp. 547-552. San 
Jose, CA. 

Belsley, D. (1980). Regression Diagnostics: Identifying Influential Data and Sources of 
Collinearity. Wiley & Sons, New York. 

Bennett, K., & Mangasarian, 0. (1992). Robust linear programming discrimination of two 
linearly inseparable sets. Optimization Methods and Software, 1, 23-34. 

Bennett, K., & Mangasarian, 0. (1994a). Multicategory discrimination via linear program- 
ming. Optimization Methods and Software, 3, 29-39. 

Bennett, K., & Mangasarian, 0. (1994b). Serial and parallel multicategory discrimination. 
SI AM Journal on Optimization, ^(4). 

Blum, A., & Rivest, R. (1988). Training a 3-node neural network is NP-complete. In Pro- 
ceedings of the 1988 Workshop on Computational Learning Theory, pp. 9-18. Boston, 
MA. Morgan Kaufmann. 

Breiman, L., Friedman, J., Olshen, R., & Stone, C. (1984). Classification and Regression 
Trees. Wadsworth International Group. 

Brent, R. P. (1991). Fast training algorithms for multilayer neural nets. IEEE Transactions 
on Neural Networks, 2(3), 346-354. 

Brodley, C. E., & Utgoff, P. E. (1992). Multivariate versus univariate decision trees. Tech. 
rep. COINS CR 92-8, Dept. of Computer Science, University of Massachusetts at 
Amherst. 

Brodley, C. E., & Utgoff, P. E. (1994). Multivariate decision trees. Machine Learning, to 
appear. 

Buntine, W. (1992). Tree classification software. Technology 2002: The Third National 
Technology Transfer Conference and Exposition. 

Buntine, W., & Niblett, T. (1992). A further comparison of splitting rules for decision-tree 
induction. Machine Learning, 8, 75-85. 

Cardie, C. (1993). Using decision trees to improve case-based learning. In Proceedings of 
the Tenth International Conference on Machine Learning, pp. 25-32. University of 
Massachusetts, Amherst. 

Cestnik, G., Kononenko, I., & Bratko, I. (1987). Assistant 86: A knowledge acquisition 
tool for sophisticated users. In Bratko, I., & Lavrac, N. (Eds.), Progress in Machine 
Learning. Sigma Press. 

Cios, K. J., & Liu, N. (1992). A machine learning method for generation of a neural network 
architecture: A continuous ID3 algorithm. IEEE Transactions on Neural Networks, 
5(2), 280-291. 



28 



Induction of Oblique Decision Trees 



Cohen, W. (1993). Efficient pruning metliods for separate- and-conquer rule learning sys- 
tems. In Proceedings of the 13th International Joint Conference on Artificial Intelli- 
gence, pp. 988-994. Morgan Kaufmann. 

Fayyad, U. M., & Irani, K. B. (1992). The attribute specification problem in decision tree 
generation. In Proceedings of the Tenth National Conference on Artificial Intelligence, 
pp. 104-110. San Jose CA. AAAI Press. 

Frean, M. (1990). Small Nets and Short Paths: Optimising neural computation. Ph.D. 
thesis. Centre for Cognitive Science, University of Edinburgh. 

Gupta, R., Smolka, S., & Bhaskar, S. (1994). On randomization in sequential and distributed 
algorithms. ACM Computing Surveys, 26(1), 7-86. 

Hampson, S., & Volper, D. (1986). Linear function neurons: Structure and training. Bio- 
logical Cybernetics, 53, 203-217. 

Harrison, D., & Rubinfeld, D. (1978). Hedonic prices and the demand for clean air. Journal 
of Environmental Economics and Management, 5, 81-102. 

Hassibi, B., & Stork, D. (1993). Second order derivatives for network pruning: optimal 
brain surgeon. In Advances in Neural Information Processing Systems 5, pp. 164-171. 
Morgan Kaufmann, San Mateo, CA. 

Heath, D. (1992). A Geometric Framework for Machine Learning. Ph.D. thesis, Johns 
Hopkins University, Baltimore, Maryland. 

Heath, D., Kasif, S., & Salzberg, S. (1993a). k-DT: A multi-tree learning method. In 
Proceedings of the Second International Workshop on Multistrategy Learning, pp. 138- 
149. Harpers Ferry, WV. George Mason University. 

Heath, D., Kasif, S., & Salzberg, S. (1993b). Learning oblique decision trees. In Proceedings 
of the 13th International Joint Conference on Artificial Intelligence, pp. 1002-1007. 
Chambery, France. Morgan Kaufmann. 

Herman, G. T., & Yeung, K. D. (1992). On piecewise-linear classification. IEEE Transac- 
tions on Pattern Analysis and Machine Intelligence, 14i^)j 782-786. 

Holte, R. (1993). Very simple classification rules perform well on most commonly used 
datasets. Machine Learning, 11(1), 63-90. 

Hyafil, L., & Rivest, R. L. (1976). Constructing optimal binary decision trees is NP- 
complete. Information Processing Letters, 5(1), 15-17. 

Kira, K., & Rendell, L. (1992). A practical approach to feature selection. In Proceedings 
of the Ninth International Conference on Machine Learning, pp. 249-256. Aberdeen, 
Scotland. Morgan Kaufmann. 

Kirkpatrick, S., Gelatt, C, & Vecci, M. (1983). Optimization by simulated annealing. 
Science, 220(4598), 671-680. 



29 



MURTHY, KASIF & SaLZBERG 



KodratofF, Y., & Manage, M. (1987). Generalization and noise. International Journal of 
Man-Machine Studies, 27, 181-204. 

Langley, P., & Sage, S. (1993). Scaling to domains with many irrelevant features. Learning 
Systems Department, Siemens Corporate Research, Princeton, NJ. 

Mangasarian, 0., Setiono, R., & Wolberg, W. (1990). Pattern recognition via linear pro- 
gramming: Theory and application to medical diagnosis. In SIAM Workshop on 
Optimization. 

Mingers, J. (1989a). An empirical comparison of pruning methods for decision tree induc- 
tion. Machine Learning, ^(2), 227-243. 

Mingers, J. (1989b). An empirical comparison of selection measures for decision tree induc- 
tion. Machine Learning, 3, 319-342. 

Moret, B. M. (1982). Decision trees and diagrams. Computing Surveys, -?-^(4), 593-623. 

Murphy, P., & Aha, D. (1994). UCI repository of machine learning databases - a machine- 
readable data repository. Maintained at the Department of Information and Computer 
Science, University of California, Irvine. Anonymous FTP from ics.uci.edu in the 
directory pub/machine-learning-databases. 

Murthy, S. K., Kasif, S., Salzberg, S., & Beigel, R. (1993). OCl: Randomized induction of 
oblique decision trees. In Proceedings of the Eleventh National Conference on Artificial 
Intelligence, pp. 322-327. Washington, D.C. MIT Press. 

Murthy, S. K., & Salzberg, S. (1994). Using structure to improve decision trees. Tech. rep. 
JIIU-94/12, Department of Computer Science, Johns Hopkins University. 

Niblett, T. (1986). Constructing decision trees in noisy domains. In Bratko, I., & Lavrac, 
N. (Eds.), Progress in Machine Learning. Sigma Press, England. 

Nilsson, N. (1990). Learning Machines. Morgan Kaufmann, San Mateo, CA. 

Odewahn, S., Stockwell, E., Pennington, R., Humphreys, R., & Zumach, W. (1992). Au- 
tomated star-galaxy descrimination with neural networks. Astronomical Journal, 
103(1), 318-331. 

Pagallo, G. (1990). Adaptive Decision Tree Algorithms for Learning From Examples. Ph.D. 
thesis. University of California at Santa Cruz. 

Pagallo, G., & Haussler, D. (1990). Boolean feature discovery in empirical learning. Machine 
Learning, 5(1), 71-99. 

Quinlan, J. R. (1983). Learning efficient classification procedures and their application to 
chess end games. In Michalski, R., Carbonell, J., & Mitchell, T. (Eds.), Machine 
Learning: An Artificial Intelligence Approach. Morgan Kaufmann, San Mateo, CA. 

Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1, 81-106. 



30 



Induction of Oblique Decision Trees 



Quinlan, J. R. (1987). Simplifying decision trees. International Journal of Man-Machine 
Studies, 27, 221-234. 

Quinlan, J. R. (1993a). C4-5: Programs for Machine Learning. Morgan Kaufmann Pub- 
lishers, San Mateo, CA. 

Quinlan, J. R. (1993b). Combining instance-based and model-based learning. In Proceedings 
of the Tenth International Conference on Machine Learning, pp. 236-243 University 
of Massachusetts, Amherst. Morgan Kaufmann. 

Roth, R. H. (1970). An approach to solving linear discrete optimization problems. Journal 
of the ACM, 17(2), 303-313. 

Safavin, S. R., & Landgrebe, D. (1991). A survey of decision tree classifier methodology. 
IEEE Transactions on Systems, Man and Cybernetics, 21(3), 660-674. 

Sahami, M. (1993). Learning non-linearly separable boolean functions with linear thresh- 
old unit trees and madaline- style networks. In Proceedings of the Eleventh National 
Conference on Artificial Intelligence, pp. 335-341. AAAI Press. 

Salzberg, S. (1991). A nearest hyperrectangle learning method. Machine Learning, 6, 
251-276. 

Salzberg, S. (1992). Combining learning and search to create good classifiers. Tech. rep. 
JHU-92/12, Johns Hopkins University, Baltimore MD. 

Salzberg, S., Chandar, R., Ford, H., Murthy, S. K., k White, R. (1994). Decision trees for 
automated identification of cosmic rays in Hubble Space Telescope images. Publica- 
tions of the Astronomical Society of the Pacific, to appear. 

Schaffer, C. (1993). Overfitting avoidance as bias. Machine Learning, 10, 153-178. 

Schlimmer, J. (1993). Efficiently inducing determinations: A complete and systematic 
search algorithm that uses optimal pruning. In Proceedings of the Tenth International 
Conference on Machine Learning, pp. 284-290. Morgan Kaufmann. 

Smith, J., Everhart, J., Dickson, W., Knowler, W., & Johannes, R. (1988). Using the 
ADAP learning algorithm to forecast the onset of diabetes mellitus. In Proceedings 
of the Symposium on Computer Applications and Medical Care, pp. 261-265. IEEE 
Computer Society Press. 

Utgoff, P. E. (1989). Perceptron trees: A case study in hybrid concept representations. 
Connection Science, 1(4:), 377-391. 

Utgoff, P. E., & Brodley, C. E. (1990). An incremental method for finding multivariate 
splits for decision trees. In Proceedings of the Seventh International Conference on 
Machine Learning, pp. 58-65. Los Altos, CA. Morgan Kaufmann. 

Utgoff, P. E., & Brodley, C. E. (1991). Linear machine decision trees. Tech. rep. 10, 
University of Massachusetts at Amherst. 



31 



MURTHY, KASIF & SaLZBERG 



Van de Merckt, T. (1992). NFDT: A system that learns flexible concepts based on decision 
trees for numerical attributes. In Proceedings of the Ninth International Workshop on 
Machine Learning, pp. 322-331. 

Van de Merckt, T. (1993). Decision trees in numerical attribute spaces. In Proceedings of 
the 13th International Joint Conference on Artificial Intelligence, pp. 1016-1021. 

Weiss, S., & Kapouleas, I. (1989). An empirical comparison of pattern recognition, neural 
nets, and machine learning classiflcation methods. In Proceedings of the 11th Inter- 
national Joint Conference of Artificial Intelligence, pp. 781-787. Detroit, MI. Morgan 
Kaufmann. 

Wolpert, D. (1992). On overfltting avoidance as bias. Tech. rep. SFI TR 92-03-5001, The 
Santa Fe Institute, Santa Fe, New Mexico. 



32 



